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Abstract 

A thesaurus is one, out of many, possible representations of term (or word) connec- 
tivity. The terms of a thesaurus are seen as the nodes and their relationship as the 
links of a directed graph. The directionality of the links retains all the thesaurus 
information and allows the measurement of several quantities. This has lead to a 
new term classification according to the characteristics of the nodes, for example, 
nodes with no links in, no links out, etc. Using an electronic available thesaurus 
we have obtained the incoming and outgoing link distributions. While the incom- 
ing link distribution follows a stretched exponential function, the lower bound for 
the outgoing link distribution has the same envelope of the scientific paper citation 
distribution proposed by Albuquerque and Tsallis [1]. However, a better fit is ob- 
tained by simpler function which is the solution of Ricatti's differential equation. 
We conjecture that this differential equation is the continuous limit of a stochastic 
growth model of the thesaurus network. We also propose a new manner to arrange 
a thesaurus using the "inversion method" . 

Key words: complex networks, directed graphs, thesaurus 
PACS: 05.90.+m, 02.50.-r, 



Words are the building blocks to construct sentences and to transmit infor- 
mation. During last decades much effort has been spent on the statistics of 
words. Concern has been centered in the similarities and differences among 
word distributions which may be useful for application in automatic informa- 
tion retrieval and thesaurus construction. 
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Zipf [2] has shown that word frequency obeys a power law if words are ranked 
from the most to the less frequent ones. Statistical linguistics, at its lowest 
level, can be exemplified by the Zipf's exponent, which is very sensitive to the 
writer's instruction degree but much less sensitive to language (culture char- 
acteristics). Beyond word level, word connectivity has been treated in several 
manners. These treatments include entropic measures [3] and the construction 
of other quantities, such as the distribution of documents over the frequency of 
words [4] . Another interesting way to treat data is the Latent Semantic Analy- 
sis (LSA) [5] which deals with word covariance in a corpus. LSA is a principal 
component analysis (PCA) technique , i.e., the covariance matrix is diagonal- 
ized and from the most important eigenvalues (around 300) the eigenvectors 
are considered to span an Euclidean vector space. A curious application of 
LSA is the automatic grading of high school texts [6]. However, LSA has been 
criticized as a poor approach for predicting semantic neighborhood [7]. 

Other studies have focused on a different approach. Words are tied to each 
other as links of a graph where the words are the nodes of it. Exhaustive 
studies over thesaurus [8,9] indicate that words are related among themselves 
as a small- world and scale-free network [10]. This means that words may be 
embedded in a low dimensional space but with a small fraction of long distance 
connections. The existence of the low dimensional space has been suggested by 
the deterministic "tourist" walks [11,12] on the graph, which is an independent 
sampling procedure [13]. 

A thesaurus is a list of terms. A term can be a word, a composed word or 
even an expression. The list of related terms to a main entry term (head- 
word) provides alternatives for these entries. Following previous studies, we 
will consider terms as being "words" in a broad sense. 

As in a previous work [9], our study is based on unstructured thesaurus, the 
Moby Thesaurus II which is the largest 1 and most comprehensive free the- 
saurus data source in English available [14]. It has 30,260 (main) entries, also 
called root words 2 or head-words 3 and 73,046 words which are referred from 
the entries but they are not entries. They are called non-root words. These 
add up to 103,306 different words. Each root word points, in average, to 83 
words 4 . 

The thesaurus derived network is defined considering each term as a node. 
Connections are established from an entry to its related list of terms forming 

1 The file has 24,271 KB. 

2 A root word should not be confused with a root node which is defined as one that 
has no incoming link. 

3 Some curiosities are: 877 words which are not referred from other entries, 16 words 
are entry words but point only to non-root words. 

4 Prom which, in average, 54 are root words and 29 are non-root words. 
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a directed graph. Classification of terms can be accomplished looking at the 
links (arcs), for instance: head words (root words) are words with at least one 
emerging link (k out > 0) and non-root words are words with no emerging links 
(k out = 0). Apparently there is a giant strong component (percolative cluster of 
directed links) which connects a large fraction of words [15,16]. We stress that 
the working thesaurus is a simple and unstructured related term thesaurus 
and we point out the existence of other thesauruses such as WordNet [17] and 
definition terms thesaurus (Roget's thesaurus) which may be modeled as a 
bipartite graph, but they are be considered here. 

If only co-linked terms (mutually referred terms) are considered, this structure 
forms a digraph and reduces to the previous studied one, where its small-world 
and scale free structure has been pointed out [9]. In this case, the number of 
connections k of a node is called degree of a node. The node degree statistics 
shows an exponential behavior for small values of k and a power law behavior 
for large values of k [9]. 

The directed graph concepts permit us to classify sets of terms according to 
its links properties as follows: 

sink composed of the 73,046 terms with k ou t = 0. For example: glucose, pass- 
word, all-around, grape juice, send word, put to, lap dog, afterbirth; 

source are the 30,260 terms with at least one outgoing link (k out > 0), usually 
called main entries, entries, head-words or root words. The source can be 
divided into three categories; 

absolute source is related to 877 terms without incoming links ki n = 0. 
For example: rackets, grammatical, double quick, half moon, blinded; 

normal source are 29,333 terms that receive links and send links to other 
source and sink terms {k out > and k in > 0). For example: ablation, 
analogy, call out, factitious, laid low, make a deal; 

bridge source they are the 16 terms without outgoing links to source 
terms {k out (source) = 0), listed: androgyny, Christian sectarians, Congress, 
detector, electric meter, enzyme, Esperanto, et cetera, Geiger counter, 
ghetto dwellers, harp, in fun, lobotomy, penicillin, perversely, Senate; 

These definitions are illustrated by the subgraphs 1-4 in Figure 1. 

It is interesting to observe that the outgoing link distribution could be fitted 
by the scientific-papers citation frequency curve [1]: 

f(k out ) = — — /( } , (1) 
[1 + (q - l)\k out \ qiKq 

with: N = 654 ± 7, A = (1.66 ± 0.01) x 10~ 2 and q = 0.95 ± 0.01 (r 2 = 0.993 
and x 2 — 77), see Fig. 2. However, its limiting behavior for small k out is 
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Fig. 1. Coarse grained view of the thesaurus as a directed graph. The region com- 
posed by subgraphs 1 to 3 is the source and subgraph 4 is referred as the sink. The 
source contains: the normal source, named as subgraph 1; bridge source, named as 
subgraph 2 and absolute source, here called subgraph 3. 

exponential instead of the measured stretched exponential. So, we propose 
another fitting function with power law behavior for large k out and stretched 
exponential behavior for small k out 5 

f{Kut) = — , (2) 

1 -|- AAt oui 



with: N = 468 ± 3, A = (2.0 ± 0.4) x 10~ 5 and re = 2.55 ± 0.03 (r 2 = 0.990 
and x 2 = 99), see Fig. 2. 

For \k^ ut <C 1 this curve presents the same behaviour of a stretched expo- 
nential [19]: f(k out ) = iV exp(— k out /k out ) K , which permits us to estimate the 
mean value of outgoing links equal to k out = X" 1 ^ = 70 ± 6, which must be 
compared to 83 which is obtained from the thesaurus statistics. For \k* ut ^> 1, 
the distribution of k out is a power law: f{k out ) = (N / X)k~^ t . 

On the other hand, we show in Figure 3 that the frequency of words with a 
given number of incoming links {k in ) is very well described by the stretched 
exponential curve: 

f( kin ) = N Q e W (-bA , (3) 



5 Note that introducing a new parameter one can write f(x) = Nq/[1 + \x K ] q ^ q ~ 1 ^ 
which is the Burr XII distribution function that appears as a result of a g-logarithm 
entropy maximization [18] and generalizes both functions. 
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Fig. 2. The frequency of outgoing links k ou t (root words) is well described by 
Eq. 2 which is the rightwards curve, in contrast with curve of Eq. 1. The point 
[kout = 17, / (kout) = 3] has been excluded in both fitting procedures. These words 
are: for good, for keeps, and grin. 



where we have found N = 12000 ± 300, k in = 4.9 ± 0.3 and re = 0.52 ± 0.01, 
(r 2 = 0.993 and x 2 — 4.58). We shall stress that a fitting curve of the type of 
Eq. 2 also describe this data if A is taken small enough. A simple approxima- 
tion may be used as: f(k in ) oc exp(— \fk~^). The low values of incoming links 
(ki n < 10) are dominated by non-root words while high values > 100) are 
dominated by root words, as seen in Figure 3. 

Although empirically f(k in ) and f(k out ) are apparently different, this may be 
due to a finite size database effect. This is suggested by a k in x k out plot (Fig. 4) 
where ki n and k out are ranked by decreasing values and plotted jointly to show 
the correlation between them. From Fig. 4, it is clear that a linear correlation 
occurs for k > 100. A perfect thesaurus should have a symmetric property 



As suggested by the above analysis, let Eq. 2 represent both the distribution 
of outgoing and incoming links. If one takes the variable to be continuous, 
it is not hard to notice that Eq. 2 is the solution of the Ricatti's differential 
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Fig. 3. Frequency of incoming links ki n for all words (•), root words (A) and non-root 
words (□). The curve for all words (•) is well described by a stretched exponential 
(line) expressed by Equation 3 (JV = 12000±300, k in = 4.9±0.3 and k = 0.52±0.01) 
which is dominated by non-root words for low k% n values {ki n < 10) and by root 
words for high ki n values (ki n > 100). 
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Fig. 4. The number of links ki n and k ou t are ranked by decreasing values and plotted 
jointly to show the correlation. 
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equation 6 

y\x) = - ■ (4) 



This equation is known to represent contact processes such as the propagation 
of diseases [20]. Presently we are searching for a microscopic network growth 
model that has the Ricatti's equation as a continuum limit. 

A thesaurus is a attempt to synthesize terms and their relationships as natural 
as possible. Nevertheless this trial is artificial and subjective. Our work of 
treating the thesaurus as a directed graph has provided new insights into its 
macro structure. From this graph theoretical approach the counting of fcj n 
and k out could lead to a novel proposition of term arrangement and term 
connectivities in it. 

The standard thesaurus classification is made according to word frequency in 
a corpus. But our approach suggested to rank the term related to a given root- 
word by its ki n ranking or any other connectivity index. For a finite thesaurus 
where the number of root-words is smaller than the total number of terms, 
we suggest that it is always possible to construct a new closed thesaurus 
by inversion between the initial k out > root-words and k in > non-root 
words. The information in both cases are the same, but the latter leads to 
practical facilities, for instance, the fact that all words present in the thesaurus 
become root- words (the thesaurus becomes closed). We are submitting our 
closed version of Moby Thesaurus as a freeware database to the Gutenberg 
project [14]. 

The authors are deeply grateful to Vera Lucia Coelho Villar from the Insti- 
tuto Antonio Houaiss de Lexicografia, Brazil, for the fruitful discussions The 
authors thank stimulation discussion with F. Brouers , M. G. V. Nunes, B. 
C. D. da Silva and C. Tsallis. This work has been partially funded by the 
Brazilian agencies: FAPESP, CAPES and CNPq. 
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